PLOTTING AMCAT DATA BY EXPLORATORY DATA ANALYSIS (EDA) USING MATPLOTLIB & SEABORN FROM PYTHON

GRAPHS: A diagram showing the relation between variable quantities, typically of two variables, each measured along one of a pair of axes at right angles.

Exploratory Data Analysis (EDA), is essentially a type of story telling for statisticians. It allows us to uncover patterns and insights, often with visual methods, within data.

STEP - I: IMPORTING THE NECESSARY LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

NUMPY: It's a Python library used for working with arrays, also has functions for working in domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical Python.

PANDAS: It's a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

MATPLOTLIB: It's a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+..... SciPy makes use of Matplotlib.

SEABORN: It's a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

PLOTLY EXPRESS: Plotly Express is a new high-level Python visualization library, is a wrapper for Plotly.py that exposes a simple syntax for complex charts.

STEP - II: IMPORTING THE DATA

In [2]:
df = pd.read_excel(r'C:\Users\admin\Desktop\KANAV BANSAL\KANAV TASKS\TASK - II - EDA - AMCAT DATA/reports.xlsx')
df
Out[2]:
Unnamed: 0 ID Salary DOJ DOL Designation JobCity Gender DOB 10percentage ... ComputerScience MechanicalEngg ElectricalEngg TelecomEngg CivilEngg conscientiousness agreeableness extraversion nueroticism openess_to_experience
0 train 203097 420000 2012-06-01 present senior quality engineer Bangalore f 1990-02-19 84.30 ... -1 -1 -1 -1 -1 0.9737 0.8128 0.5269 1.35490 -0.4455
1 train 579905 500000 2013-09-01 present assistant manager Indore m 1989-10-04 85.40 ... -1 -1 -1 -1 -1 -0.7335 0.3789 1.2396 -0.10760 0.8637
2 train 810601 325000 2014-06-01 present systems engineer Chennai f 1992-08-03 85.00 ... -1 -1 -1 -1 -1 0.2718 1.7109 0.1637 -0.86820 0.6721
3 train 267447 1100000 2011-07-01 present senior software engineer Gurgaon m 1989-12-05 85.60 ... -1 -1 -1 -1 -1 0.0464 0.3448 -0.3440 -0.40780 -0.9194
4 train 343523 200000 2014-03-01 2015-03-01 00:00:00 get Manesar m 1991-02-27 78.00 ... -1 -1 -1 -1 -1 -0.8810 -0.2793 -1.0697 0.09163 -0.1295
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3993 train 47916 280000 2011-10-01 2012-10-01 00:00:00 software engineer New Delhi m 1987-04-15 52.09 ... -1 -1 -1 -1 -1 -0.1082 0.3448 0.2366 0.64980 -0.9194
3994 train 752781 100000 2013-07-01 2013-07-01 00:00:00 technical writer Hyderabad f 1992-08-27 90.00 ... -1 -1 -1 -1 -1 -0.3027 0.8784 0.9322 0.77980 -0.0943
3995 train 355888 320000 2013-07-01 present associate software engineer Bangalore m 1991-07-03 81.86 ... -1 -1 -1 -1 -1 -1.5765 -1.5273 -1.5051 -1.31840 -0.7615
3996 train 947111 200000 2014-07-01 2015-01-01 00:00:00 software developer Asifabadbanglore f 1992-03-20 78.72 ... 438 -1 -1 -1 -1 -0.1590 0.0459 -0.4511 -0.36120 -0.0943
3997 train 324966 400000 2013-02-01 present senior systems engineer Chennai f 1991-02-26 70.60 ... -1 -1 -1 -1 -1 -1.1128 -0.2793 -0.6343 1.32553 -0.6035

3998 rows × 39 columns

Observation :

There are 3998 data of the people which contains the Gender, Designation, Salary, Location, Percentage of 10th, 12th and B.Tech including with College Tier -Specialization, etc things.

STEP - III: DATA ANALYSIS

include = 'all' is provided as an option, the result will include a union of attributes of each type. The includes and excludes parameters, can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series & describes a numeric Series.

In [3]:
df.describe(include='all')
Out[3]:
Unnamed: 0 ID Salary DOJ DOL Designation JobCity Gender DOB 10percentage ... ComputerScience MechanicalEngg ElectricalEngg TelecomEngg CivilEngg conscientiousness agreeableness extraversion nueroticism openess_to_experience
count 3998 3.998000e+03 3.998000e+03 3998 3998 3998 3998 3998 3998 3998.000000 ... 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000 3998.000000
unique 1 NaN NaN 81 67 419 339 2 1872 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top train NaN NaN 2014-07-01 00:00:00 present software engineer Bangalore m 1991-01-01 00:00:00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 3998 NaN NaN 199 1875 539 627 3041 11 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
first NaN NaN NaN 1991-06-01 00:00:00 NaN NaN NaN NaN 1977-10-30 00:00:00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
last NaN NaN NaN 2015-12-01 00:00:00 NaN NaN NaN NaN 1997-05-27 00:00:00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN 6.637945e+05 3.076998e+05 NaN NaN NaN NaN NaN NaN 77.925443 ... 90.742371 22.974737 16.478739 31.851176 2.683842 -0.037831 0.146496 0.002763 -0.169033 -0.138110
std NaN 3.632182e+05 2.127375e+05 NaN NaN NaN NaN NaN NaN 9.850162 ... 175.273083 98.123311 87.585634 104.852845 36.658505 1.028666 0.941782 0.951471 1.007580 1.008075
min NaN 1.124400e+04 3.500000e+04 NaN NaN NaN NaN NaN NaN 43.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -4.126700 -5.781600 -4.600900 -2.643000 -7.375700
25% NaN 3.342842e+05 1.800000e+05 NaN NaN NaN NaN NaN NaN 71.680000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -0.713525 -0.287100 -0.604800 -0.868200 -0.669200
50% NaN 6.396000e+05 3.000000e+05 NaN NaN NaN NaN NaN NaN 79.150000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.046400 0.212400 0.091400 -0.234400 -0.094300
75% NaN 9.904800e+05 3.700000e+05 NaN NaN NaN NaN NaN NaN 85.670000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.702700 0.812800 0.672000 0.526200 0.502400
max NaN 1.298275e+06 4.000000e+06 NaN NaN NaN NaN NaN NaN 97.760000 ... 715.000000 623.000000 676.000000 548.000000 516.000000 1.995300 1.904800 2.535400 3.352500 1.822400

13 rows × 39 columns

Obesrvation:

We are able to get the information of MEAN, MEDIAN, STANDARD DEVIATION, MINIMUM & MAXIMUM VALUES OF ENTIRE DATA.

GETTING THE TOP 5 & LAST 5 DATA

In [4]:
X = df.head()
X
Out[4]:
Unnamed: 0 ID Salary DOJ DOL Designation JobCity Gender DOB 10percentage ... ComputerScience MechanicalEngg ElectricalEngg TelecomEngg CivilEngg conscientiousness agreeableness extraversion nueroticism openess_to_experience
0 train 203097 420000 2012-06-01 present senior quality engineer Bangalore f 1990-02-19 84.3 ... -1 -1 -1 -1 -1 0.9737 0.8128 0.5269 1.35490 -0.4455
1 train 579905 500000 2013-09-01 present assistant manager Indore m 1989-10-04 85.4 ... -1 -1 -1 -1 -1 -0.7335 0.3789 1.2396 -0.10760 0.8637
2 train 810601 325000 2014-06-01 present systems engineer Chennai f 1992-08-03 85.0 ... -1 -1 -1 -1 -1 0.2718 1.7109 0.1637 -0.86820 0.6721
3 train 267447 1100000 2011-07-01 present senior software engineer Gurgaon m 1989-12-05 85.6 ... -1 -1 -1 -1 -1 0.0464 0.3448 -0.3440 -0.40780 -0.9194
4 train 343523 200000 2014-03-01 2015-03-01 00:00:00 get Manesar m 1991-02-27 78.0 ... -1 -1 -1 -1 -1 -0.8810 -0.2793 -1.0697 0.09163 -0.1295

5 rows × 39 columns

OBSERVATION:

There are 3 males & 2 females, where one of the male has highest Salary Package who is a SENIOR SOFTWARE ENGINEER.

In [5]:
Y = df.tail()
Y
Out[5]:
Unnamed: 0 ID Salary DOJ DOL Designation JobCity Gender DOB 10percentage ... ComputerScience MechanicalEngg ElectricalEngg TelecomEngg CivilEngg conscientiousness agreeableness extraversion nueroticism openess_to_experience
3993 train 47916 280000 2011-10-01 2012-10-01 00:00:00 software engineer New Delhi m 1987-04-15 52.09 ... -1 -1 -1 -1 -1 -0.1082 0.3448 0.2366 0.64980 -0.9194
3994 train 752781 100000 2013-07-01 2013-07-01 00:00:00 technical writer Hyderabad f 1992-08-27 90.00 ... -1 -1 -1 -1 -1 -0.3027 0.8784 0.9322 0.77980 -0.0943
3995 train 355888 320000 2013-07-01 present associate software engineer Bangalore m 1991-07-03 81.86 ... -1 -1 -1 -1 -1 -1.5765 -1.5273 -1.5051 -1.31840 -0.7615
3996 train 947111 200000 2014-07-01 2015-01-01 00:00:00 software developer Asifabadbanglore f 1992-03-20 78.72 ... 438 -1 -1 -1 -1 -0.1590 0.0459 -0.4511 -0.36120 -0.0943
3997 train 324966 400000 2013-02-01 present senior systems engineer Chennai f 1991-02-26 70.60 ... -1 -1 -1 -1 -1 -1.1128 -0.2793 -0.6343 1.32553 -0.6035

5 rows × 39 columns

Observation:

There are 3 females & 2 males, where one of the female has highest Salary Package who is a SENIOR SYSTEMS ENGINEER.

STEP - IV: PLOTTING THE DATA

DISTPLOT: Seaborn distplot lets you show a histogram with a line on it. We use seaborn in combination with matplotlib, the Python plotting module. It plots a UNIVARIATE distribution of observations. The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

KDE PLOT: kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample and Show a univariate or bivariate distribution.

RUG PLOT: A rug plot is a plot of data for a single quantitative variable, displayed as marks along an axis. It is used to visualise the distribution of the data. As such it is analogous to a histogram with zero-width bins, or a one-dimensional scatter plot. Draws a small vertical lines to show each observation in a distribution.

In [6]:
sns.distplot(df['Salary'])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fb94bc2b0>
In [7]:
sns.distplot(df['Salary'], kde = False, rug = True)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fb9899610>
In [8]:
sns.distplot(df['Salary'], kde = False, rug = False)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fba170970>
In [9]:
sns.distplot(df['collegeGPA'], kde = False, rug = True)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fba22ccd0>
In [10]:
sns.distplot(df['Logical'], kde = False, rug = True)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fb95919a0>

Observation:

  1. It's a figure-level function with a similar flexibility over the kind of plot to draw.

  2. It's basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.

  3. The plot shows a simple distribution when it creats a random values with random.randn().

POINT PLOT: A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

CATPLOT: It's a relatively new addition to Seaborn that simplifies plotting that involves categorical variables. Catplot function can do all these types of plots and one can specify the type of plot one needs with the kind parameter. The default kind in catplot() is “strip”, corresponding to stripplot(). Combines a categorical plot with a FacetGrid.

In [11]:
sns.catplot(x = "Gender", y = "Salary", hue = "CollegeTier", kind = "point", data = df)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x20fb95fe550>
In [12]:
sns.catplot(x = "Gender", y = "Salary", hue = "Degree", kind = "point", data = df)
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x20fb9198940>
In [13]:
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = X)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x20fba5d1a90>
In [14]:
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = Y)
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x20fba2c6c40>
In [15]:
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data=df)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x20fba53cfd0>
In [16]:
sns.catplot(x="Degree", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = df)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x20fb91b01f0>

Observation:

  1. Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.

  2. They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable. The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.

  3. It is important to keep in mind that a point plot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.

VIOLIN PLOT: A violin plot gives a combination of boxplot and kernel density estimate, plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.

In [17]:
sns.violinplot(x = df.CollegeTier, y = df.collegeGPA)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fbb7fa2b0>
In [18]:
sns.violinplot(x = df.Degree, y = df.collegeGPA)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fbba6b640>
In [19]:
sns.catplot(x= "Degree", y= "collegeGPA", hue= "Gender", kind= "violin", inner= "stick", split=True, palette="pastel", data= df)
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x20fbba492e0>
In [20]:
sns.catplot(x = "Degree", y = "collegeGPA", hue = "Gender", kind = "violin", split = True, data = df)
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x20fba4389d0>
In [21]:
sns.catplot(x = "collegeGPA", y = "Degree", hue = "Gender", kind = "violin", bw = .15, cut = 0, data = df)
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x20fba4294f0>
In [22]:
sns.catplot(x = "Degree", y = "collegeGPA", hue = "Gender", kind = "violin", data = df)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x20fc357aaf0>

Observation:

  1. The white dot represents the median, the thick gray bar in the center represents the interquartile range, the thin gray line represents the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the interquartile range.

  2. Violin plot are made vertically most of the time, If you have long labels building an horizontal version{Output[22]} like above make the labels more readable.

  3. If the variable are grouped, we can build a grouped violin as you would do for a boxplot.

BOXPLOT: A Boxplot is a very basic plot - used to visualize distributions. It's very useful when you want to compare data between two groups. Sometimes a boxplot is named a box-and-whisker plot. Any box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.

In [23]:
sns.catplot(x = "Degree", y = "Salary", kind = "boxen", data = df.sort_values("collegeGPA"))
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x20fc374afd0>
In [24]:
sns.catplot(x = "Degree", y = "Salary", hue = "Gender", kind = "box", data = df)
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x20fc35f35e0>
In [25]:
sns.boxplot(data = X, x='Specialization', y='Salary')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc379c700>
In [26]:
sns.boxplot(data = df, x = 'Specialization', y = 'Salary')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc38d2a00>
In [27]:
sns.boxplot(data = df, x = 'Designation', y = 'Salary')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc3d58b50>
In [28]:
sns.boxplot(data = df, x = 'GraduationYear', y = 'Salary')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc6e98280>
In [29]:
sns.boxplot(data = df, x = 'CollegeTier', y = 'Salary')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc6ed5970>
In [30]:
sns.boxplot(data = df, x = 'CollegeTier', y = 'Degree')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc70062b0>
In [31]:
sns.boxplot(data = X, x = 'collegeGPA', y = 'Designation')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc704d2e0>
In [32]:
sns.boxplot(data = df, x = 'Degree', y = 'Salary')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc7137850>
In [33]:
sns.boxplot(data = X, x = 'Salary', y = 'Designation', hue = 'Specialization')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc8197a60>
In [34]:
sns.boxplot(data = Y, x = 'Specialization', y = 'Salary', hue = 'Designation')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc82473a0>
In [35]:
sns.catplot(x = "English", y = "Logical", kind = "boxen", data = df.sort_values("Gender"))
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x20fc6ed5940>
In [36]:
plt.boxplot(df['Salary'])
plt.show()
In [37]:
plt.boxplot(df['collegeGPA'])
plt.show()

Observation:

  1. OUTLIERS: An outlier is an observation that is numerically distant from the rest of the data. Box plots are useful as they show outliers within a data set. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.

  2. Comparision of the medians, the interquartile ranges and whiskers of box plots.

  3. Gives the potential outliers and signs of Skewness.

SWARM PLOT: This style of plot is sometimes called a “beeswarm”. A swarm plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

In [38]:
sns.catplot(x = "Gender", y = "Salary", order = ["m", "f"], data = df)
Out[38]:
<seaborn.axisgrid.FacetGrid at 0x20fc81ef7c0>
In [39]:
sns.catplot(x = "English", y = "Quant", hue = "Gender", kind = "swarm", data = df)
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x20fc8840a60>
In [40]:
sns.catplot(x = "GraduationYear", y = "collegeGPA", data = df)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x20fc6b38ca0>
In [41]:
sns.catplot(x = "Degree", y = "Salary", hue = "Gender", kind = "swarm", data = df)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x20fc88a7c40>
In [42]:
sns.catplot(x = "CollegeTier", y = "collegeGPA", kind = "swarm", data = df)
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x20fc9dbb8b0>
In [43]:
sns.swarmplot(data = df, x = 'Specialization', y = 'Salary')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc9bb8b80>
In [44]:
sns.swarmplot(data = df, x = 'Domain', y = 'English')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fc9cbb700>
In [45]:
sns.swarmplot(data = df, x = 'Salary', y = 'Domain')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fce0a50a0>

Observation:

  1. It can give a better representation of the distribution of observations, although it only works well for relatively small datasets.

  2. Enlarging the plot and Separate points by hue using the argument split = True.

  3. Place the legend to the right while Adjusting the y-axis limits to end at 0.

SCATTER PLOT: They can plot two-dimensional graphics that can be enhanced by mapping up to three additional variables while using the semantics of hue, size, and style parameters. Using redundant semantics can be helpful for making graphics more accessible.

JOINT PLOT: It displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid.

In [46]:
sns.jointplot(x = 'Salary', y = 'collegeGPA', data = df, kind = 'scatter')
Out[46]:
<seaborn.axisgrid.JointGrid at 0x20fce0d7a00>
In [47]:
sns.jointplot(x = 'collegeGPA', y = 'Salary', data = df, kind = 'scatter')
Out[47]:
<seaborn.axisgrid.JointGrid at 0x20fce4e23a0>
In [48]:
sns.jointplot(x = 'English', y = 'Logical', data = df, kind = 'scatter')
Out[48]:
<seaborn.axisgrid.JointGrid at 0x20fce66c580>
In [49]:
sns.jointplot(x = 'English', y = 'collegeGPA', data = df, kind = 'scatter')
Out[49]:
<seaborn.axisgrid.JointGrid at 0x20fce953b80>
In [50]:
sns.jointplot(x='Logical', y='Quant', data=df, kind = 'scatter')
Out[50]:
<seaborn.axisgrid.JointGrid at 0x20fceb0f1c0>
In [51]:
plt.scatter(df['Salary'], df['collegeGPA'])
plt.show()
In [52]:
plt.scatter(df['Salary'], df['Specialization'])
plt.show()
In [53]:
plt.scatter(df['Gender'], df['Salary'])
plt.show()
In [54]:
plt.scatter(df['English'], df['Domain'])
plt.show()
In [55]:
plt.scatter(df['Quant'], df['Logical'])
plt.show()
In [56]:
plt.scatter(df['Degree'], df['CollegeState'])
plt.show()
In [57]:
plt.scatter(df['Degree'], df['CollegeTier'])
plt.show()
In [58]:
plt.scatter(df['collegeGPA'], df['Gender'])
plt.show()

Observation:

SCATTER PLOT:

  1. Pairs of numerical figures are present.

  2. Dependent variables have multiple values for each figure associated with the independent variable.

  3. Defining if there is a relationship between two variables and only show correlation.

  4. Discrete data is best at pass/ fail measurements, Continuous data lets you measure things deeply on an infinite set and is generally used in scatter analysis.

JOINT PLOT:

  1. From the output, you can see that a joint plot has three parts.

  2. A distribution plot at the top for the column on the x-axis, a distribution plot on the right for the column on the y-axis and a scatter plot in between that shows the mutual distribution of data for both the columns.

  3. You can see that there is no correlation observed between the x, y variables as given in the input.

  4. You can change the type of the joint plot by passing a value for the kind parameter.

HEXBIN PLOT: A Hexbin plot is useful to represent the relationship of 2 numerical variables when you have a lot of data point. Instead of overlapping, the plotting window is split in several hexbins, and the number of points per hexbin is counted. The color denotes this number of points.

In [59]:
sns.jointplot(x = 'collegeGPA', y = 'Salary', data = df, kind = 'hex', color = 'k')
Out[59]:
<seaborn.axisgrid.JointGrid at 0x20fcef6aac0>
In [60]:
sns.jointplot(x = 'English', y = 'Quant', data = df, kind = 'hex', color = 'b')
Out[60]:
<seaborn.axisgrid.JointGrid at 0x20fcf0311c0>
In [61]:
sns.jointplot(x = 'CollegeTier', y = 'Domain', data = X, kind = 'hex', color = 'b')
Out[61]:
<seaborn.axisgrid.JointGrid at 0x20fcf207490>
In [62]:
sns.jointplot(x = 'Quant', y = 'Logical', data = df, kind = 'hex', color = 'b')
Out[62]:
<seaborn.axisgrid.JointGrid at 0x20fcf3ed370>
In [63]:
sns.jointplot(x = 'CollegeID', y = 'Salary', data = df, kind = 'hex', color = 'b')
Out[63]:
<seaborn.axisgrid.JointGrid at 0x20fcf4b3e80>

Observation:

  1. Instead of overlapping, the plotting window is split in several hexbins, and the number of points per hexbin is counted.

  2. The color denotes this number of points.

  3. The size of the hexagons changes - the scale of the color bar guide is redefined accordingly and We can change the size of the bins using the gridsize argument.

  4. We get a clear picture of density, distributions, and relative ranges, similar to a heat map.

  5. The shape of the hexagon allows us to limit the effects of edge biases found in square bins, while retaining the ability to form a continuous grid.

PAIR PLOT: Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data. It can also be a great jumping off point for determining types of regression analysis to use.

In [64]:
sns.pairplot(df)
Out[64]:
<seaborn.axisgrid.PairGrid at 0x20fcf6a8c10>
In [65]:
sns.pairplot(X)
Out[65]:
<seaborn.axisgrid.PairGrid at 0x20feb28c4f0>
In [66]:
sns.pairplot(Y)
Out[66]:
<seaborn.axisgrid.PairGrid at 0x20f8d689fd0>

Observation:

  1. Pairs plots are a powerful tool to quickly explore distributions and relationships in a dataset.

  2. The diagonal plots are kernel density plots where the other plots are scatter plots.

STRIP PLOT: A strip plot is a graphical data anlysis technique for summarizing a univariate data set. The strip plot consists of: Horizontal axis = the value of the response variable. It is typically used for small data sets (histograms and density plots are typically preferred for larger data sets).

In [67]:
sns.stripplot(data = df, x = 'Salary', y ='collegeGPA')
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fad838f70>
In [68]:
sns.stripplot(data = df, x = 'Specialization', y = 'Logical')
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x20fae49a7c0>
In [69]:
sns.stripplot(data = df, x = 'CollegeTier', y='Specialization')
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faea5a910>
In [70]:
sns.stripplot(data = df, x = 'English', y='Domain')
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faec6e100>
In [71]:
sns.stripplot(data = df, x = 'Logical', y='Quant')
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faeeb4e80>

Observation:

  1. Some things to keep an eye out for when looking at data on a numeric variable: (a) Skewness and Multimodality can be seen, but other visualizations show these more clearly. (b) Gaps and Outliers can be revealed and data outside of the expected range. (c) rounding, e.g. to integer values, or heaping, i.e. a few particular values occur very frequently. (d) impossible or suspicious values.

  2. Scalability in this form is limited due to over-plotting and can display up to 30,000 data points.

  3. With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups.

  4. Storage needed for vector graphics images grows linearly with the number of observations.

BAR PLOT: A bar chart or graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes called a column chart. The size of the bar represents its numeric value & also displays the values of several levels of grouping.

In [72]:
sns.barplot(x = 'Gender', y = 'Salary', data = df)
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faf0926a0>
In [73]:
sns.barplot(x = 'collegeGPA', y = 'Degree', data = df)
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faf112be0>
In [74]:
sns.barplot(x = 'Salary', y = 'collegeGPA', hue = 'Degree', data = df)
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x20faf21e970>
In [75]:
sns.barplot(x = 'CollegeTier', y = 'collegeGPA', hue = 'Degree', data = df)
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x210074d85e0>
In [76]:
sns.barplot(x = 'English', y = 'Degree', hue = 'Gender', data = df)
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x210075bc2e0>

Observation:

  1. They are created to show data in multiple, highly visual waysTo interpret the length of the bars/columns determines the value as described on the y-axis.

  2. Bar graphs have an x- and y-axis and can be used to showcase one, two, or many categories of data where Single and dual bar charts are practised using them to represent data to show the total size of groups.

  3. Shows how the proportions between groups related to each other, in addition to the total of each group.

  4. The columns can contain multiple labeled variables (or just one), or they can be grouped together (or not) for comparative purposes.

COUNT PLOT: A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot() , so you can compare counts across nested variables.

In [77]:
sns.countplot(x = 'Salary', data = df, palette = 'rainbow')
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x210076359d0>
In [78]:
sns.countplot(x = 'Salary', data = X, palette = 'rainbow')
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x21007603100>
In [79]:
sns.countplot(x = 'CollegeTier', data = df, palette = 'gist_earth')
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x2100898fd30>
In [80]:
sns.countplot(x = 'English', data = df, palette = 'autumn')
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008a18730>
In [81]:
sns.countplot(x= 'Gender', data = df, palette = 'Paired')
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008984460>
In [82]:
sns.countplot(x = 'JobCity', data = Y, palette ='twilight')
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008c65d90>
In [83]:
sns.countplot(x = 'collegeGPA', data = X, palette = 'cubehelix')
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008cd55e0>
In [84]:
sns.countplot(x = 'Degree', data = df, palette = 'Purples')
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008d30af0>
In [85]:
sns.countplot(x = 'Designation', data = Y, palette = 'cubehelix')
Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x21008d5c340>

Observation:

  1. Vectors of data represented as lists, numpy arrays, or pandas Series objects passed directly to the x, y, and/or hue parameters.

  2. In most cases, it is possible to use numpy or Python objects, but pandas objects are preferable because the associated names will be used to annotate the axes. Additionally, we can use Categorical types for the grouping variables to control the order of plot elements.

  3. A "long-form" DataFrame, in which case the x, y, and hue variables will determine how the data are plotted.

  4. A “wide-form” DataFrame, such that each numeric column will be plotted.

HISTOGRAM: A histogram is an accurate graphical representation of the distribution of numerical data. The x-axis of the histogram denotes the number of bins while the y-axis represents the frequency of a particular bin.

In [86]:
plt.hist(df['Salary'])
plt.show()
In [87]:
plt.hist(df['Designation'])
plt.show()
In [88]:
plt.hist(df['Gender'])
plt.show()
In [89]:
plt.hist(df['Specialization'])
plt.show()
In [90]:
plt.hist(df['Degree'])
plt.show()

Observation:

  1. Creating a histogram provides a visual representation of data distribution.

  2. Histograms can display a large amount of data and the frequency of the data values where the median and distribution of the data can be determined. In addition, it can show any outliers or gaps in the data.

  3. To select a "neat" number of bins and "neat" mid-point values for the data. You may over-ride this selection and set your own bin specifications when prompted.

PIE CHART: A Pie Chart is a circular statistical plot that can display only one series of data. The area of the chart is the total percentage of the given data. The area of slices of the pie represents the percentage of the parts of the data. The slices of pie are called wedges.

In [91]:
px.pie(df, names = 'Degree', values = 'collegeGPA')
In [92]:
px.pie(df, names = 'GraduationYear', values = 'English')
In [93]:
px.pie(df, names = 'Gender', values = 'GraduationYear')
In [94]:
px.pie(df, names = 'Specialization', values = 'CollegeTier')

Observation:

  1. When you interpret one pie chart, we get the differences in the size of the slices as per the data taken.

  2. The size of a slice shows the proportion of observations that are in that group.

  3. When you compare multiple pie charts, we get the differences in the size of slices for the same categories in all the pie charts.

HEATMAP: A heatmap (aka heat map) depicts values for a main variable of interest across two axis variables as a grid of colored squares. The axis variables are divided into ranges like a bar chart or histogram, and each cell's color indicates the value of the main variable in the corresponding cell range.

CORRELATION OF A HEATMAP: A correlation heatmap uses colored cells, typically in a monochromatic scale, to show a 2D correlation matrix (table) between two discrete dimensions or event types. The values of the first dimensions appear as rows of the table, while the values of the second dimension are represented by the columns of the table.

CORRELATION BETWEEN DIFFERENT COLUMNS

In [95]:
corr = df.corr(method = 'kendall')
plt.figure(figsize = (15,8))
sns.heatmap(corr, annot = True)
df.columns
Out[95]:
Index(['Unnamed: 0', 'ID', 'Salary', 'DOJ', 'DOL', 'Designation', 'JobCity',
       'Gender', 'DOB', '10percentage', '10board', '12graduation',
       '12percentage', '12board', 'CollegeID', 'CollegeTier', 'Degree',
       'Specialization', 'collegeGPA', 'CollegeCityID', 'CollegeCityTier',
       'CollegeState', 'GraduationYear', 'English', 'Logical', 'Quant',
       'Domain', 'ComputerProgramming', 'ElectronicsAndSemicon',
       'ComputerScience', 'MechanicalEngg', 'ElectricalEngg', 'TelecomEngg',
       'CivilEngg', 'conscientiousness', 'agreeableness', 'extraversion',
       'nueroticism', 'openess_to_experience'],
      dtype='object')

Observation:

  1. Feature-expression heat maps provide insight into complex associations.

  2. It utilizes effect ordered data display on two variable sets.

  3. Its applications are found in complex biological systems.

  4. Effect size (color) and statistical significance (radius) are depicted in circles.

In [ ]: